lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (6594B)


      1 +++
      2 title = 'Models for sequential data'
      3 template = 'page-math.html'
      4 +++
      5 # Models for sequential data
      6 
      7 ## Sequences
      8 They consists of numbers or symbols:
      9     - numeric 1 dimensional, e.g. stock price over time. can be n-dimensional
     10     - symbolic (categorical) 1-dimensional, like english sequence of words/characters. can be n-dimensional, with multiple categorical features per timestamp (like sheet music)
     11 
     12 we could have one sequence per instance, and try to classify the sequences (like email spam/not spam)
     13 or the whole dataset is a sequence, and instances are ordered.
     14 
     15 single sequence feature extraction:
     16 - make it a regression problem, each point is represented by the m values before it
     17 - gives us a table with a target label (value at time t) and m features (the m preceding values)
     18 - you could also use mean/variance statistics
     19 - but shit: if the data is shuffled, the classifier is trained on data that comes from the future (relative to the test data)
     20 
     21 major key: think about the real-world use case. e.g. if we want to predict future values, the training data shouldn't contain things that happen later than test data.
     22 
     23 you can do walk-forward validation, if target labels have meaningful ordering in time:
     24 
     25 ![519e53c90f0dae96942ac72ed59aacc0.png](9e5ba66c8e834fc383006a31ff012558.png)
     26 
     27 When modelling probability, break the sequence into its tokens, like words in a sentence. Each token is modeled as a random variable (_not_ independent).
     28 So you end up with joint distribution P(W₄, W₃, W₂, W₁) (with some arbitrary number of parameters.
     29 
     30 Can apply chain rule of probability:
     31 
     32 $\begin{aligned}
     33 P(W_4, W_3, W_2, W_1) &= P(W_4, W_3, W_2 | W_1) P(W_1) \\
     34                       &= P(W_4, W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) \\
     35                       &= P(W_4 | W_3, W_2, W_1) P(W_3 | W_2, W_1) P(W_2 | W_1) P(W_1)
     36 \end{aligned}$
     37 
     38 i.e.: can rewrite probability of sentences as product of probability of each word, with condition on its history.
     39 with log probability, you get a sum: $\log{P(\text{sentence})} = \sum_{word} \log{P(\text{word} | \text{words before it})}$
     40 
     41 ## Markov models
     42 Markov assumption: limit the amount of memory for previous tokens. e.g. retain a max of 2 words.
     43 The "order" is the number of words retained in the conditional.
     44 
     45 For example, if the conditional is $P(x | \text{i, will, graduate, in, a, decade})$ and it's a third-order model, the Markov assumption is $P(x | \text{i, will, graduate})$.
     46 
     47 With Markov assumption and chain rule, can model sequence as limited-memory conditional probabilities. These can be estimated from a corpus (huge piece of text).
     48 
     49 For example, to estimate prob of the word 'prize' given "won a", count how often "won a prize" occurs in text as proportion of total occurrences of "won a":
     50 
     51 $P(\text{prize} | \text{a, won}) \approx \frac{\text{\# won a prize}}{\text{\# won a}}$
     52 
     53 The word snippets are "n-grams". Three words is a trigram, two words is a bigram. I guess one word is just a gram. And maybe 1000 words would be a kilogram.
     54 
     55 Sequential sampling: start with small seed of words, then sample next word according to its probability given the previous words.
     56 
     57 ## Embedding models
     58 Model object x by embedding vector e<sub>x</sub>. The similarities of these vectors represent similarities between words.
     59 Creates embedding vectors for words, where distances and directions reflect semantic meaning.
     60 
     61 Distributional hypothesis: words that occur in same context often have similar meanings.
     62 
     63 1-hot vector: represent words as atomic objects in a monolithic vector
     64 
     65 Word2Vec:
     66 - slide context window over sequence, trying to predict distribution P(y|x) - which words likely to occur in context window given middle word
     67 - create dataset of word pairs from text
     68 - feed this dataset to two-layer network, which predicts context
     69 - softmax activation over 10k outputs is expensive, so need some tricks to make it feasible
     70 - after training, discard second layer (softmax) and only use embeddings produced by first layer
     71 
     72 ## Recurrent neural networks
     73 Neural network with cycles in it (used for sequences).
     74 
     75 Can be used for:
     76 - sequence to sequence, e.g. translating English to French
     77 - sequence to label, e.g. sequence classification
     78 - label to sequence, e.g. sentence generation
     79 
     80 Example, fully connected network with input x extended by three nodes, to which the hidden layer is copied:
     81 
     82 ![128d6b585d9c1bdaf5a568bc022d8763.png](761f27ac323e4070bf7df80a540a6831.png)
     83 
     84 Visual shorthand:
     85 - rectangle is vector of nodes
     86 - arrow feeding into the rectangle annotated with a weight matrix means fully connected transformation
     87 - if line doesn't have weight, it's a copy of input vector
     88 - if two lines flow into each other, concatenate their vectors
     89 
     90 ![d32563c1b461ded32d46894a7de2198d.png](44c66910d5c84b08a5a833c3633f9595.png) ![afb898aeecf247a602f622999280ed61.png](92011008806443b784e5f7bc2d3b0a77.png)
     91 
     92 Training RNNs:
     93 - provide input seq x, target seq t
     94 - backpropagation through time:
     95     - unroll:
     96         - every step in seq is applied in parallel to copy of the network
     97         - recurrent connection flows from previous copy to next
     98         - the whole thing is a feedforward net (network without cyles)
     99         - hidden layer inits to zero vector
    100 
    101         ![5a22d7340c41f70a48e9815f7576ec3a.png](82d2db6b48d4409582ac544450821b12.png)
    102 
    103 Basic RNNs work well, but don't learn to remember information for a long time.
    104 Can't have a long term mem for everything, need to be selective. In order to remember things long term, you need to forget a lot of other stuff (such is life).
    105 
    106 ## LSTMs
    107 "Long short-term memory".
    108 Selective forgetting and remembering, controlled by learnable "gates". Side note, from now on I'm not "studying", but I'm "selectively forgetting and remembering".
    109 
    110 *[tanh]: sigmoid rescaled so its outputs are between -1 and 1
    111 
    112 The gating mechanism takes two input vectors, which are combined with sigmoid and tanh activations.
    113 It produces an additive value -- want to figure out how much of input to add to some other vectors.
    114 The tanh is like a mapping of input to range(-1, 1) -- limits the effect of the addition vector.
    115 The sigmoid is like a selection vector.
    116 
    117 ![c60c95b58a2975c98df7c548b0585b9b.png](e8795f123d904061a5f2ae90dfdc2c4e.png)
    118 
    119 Basic operation of LSTM is a "cell". There are two recurrent connections between cells: the current output y, and the cell state C.
    120 
    121 I don't yet know how much detail we need to know about this, so I'll fill it in later based on exam questions.
    122 
    123 The prof's summary: "incredibly powerful language models. Tricky to train, very opaque." Yep, opaque and complicated, indeed.